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Abstract 

We study the generalization performance of online learning algorithms trained on samples 
coming from a dependent source of data. We show that the generalization error of any stable 
online algorithm concentrates around its regret — an easily computable statistic of the online 
performance of the algorithm — when the underlying ergodic process is /3- or 0-mixing. We show 
high probability error bounds assuming the loss function is convex, and we also establish sharp 
convergence rates and deviation bounds for strongly convex losses and several linear prediction 
problems such as linear and logistic regression, least-squares SVM, and boosting on dependent 
data. In addition, our results have straightforward applications to stochastic optimization with 
dependent data, and our analysis requires only martingale convergence arguments; we need not 
rely on more powerful statistical tools such as empirical process theory. 

1 Introduction 

Online learning algorithms have the attractive property that regret guarantees — performance of 
the sequence of points w(l), . . . , w(n) the online algorithm plays measured against a fixed predictor 
w* — hold for arbitrary sequences of loss functions, without assuming any statistical regularity of 
the sequence. It is natural to ask whether one can say something stronger when some probabilistic 
structure underlies the sequence of examples, or loss functions, presented to the online algorithm. 
In particular, if the sequence of examples are generated by a stochastic process, can the online 
learning algorithm output a good predictor for future samples from the same process? 

When data is drawn independently and identically distributed from a fixed underlying distribu- 
tion, Cesa-Bianchi et al. j3] have shown that online learning algorithms can in fact output predictors 
with good generalization performance. Specifically, they show that for convex loss functions, the 
average of the n predictors played by the online algorithm has — with high probability — small gen- 
eralization error on future examples generated i.i.d. from the same distribution. In this paper, we 
ask the same question when the data is drawn according to a (dependent) ergodic process. 

In addition, this paper helps provide justification for the use of regret to a fixed comparator 
w* as a measure of performance for online learning algorithms. Regret to a fixed predictor is 
sometimes not a natural metric, which has led several researchers to study online algorithms with 
performance gua rantees for (slowly) changing comparators w*(l), w*(2), . . . (see, e.g., Herbster and 
Warmuth [13^ 14j|). When data comes i.i.d. from a (unknown) distribution, however, online-to-batch 
conversions [7] justify computing regret with respect to a fixed w*. In this paper, we show that 
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even when data comes from a dependent stochastic process, regret to a fixed comparator is both 
meaningful and a reasonable evaluation metric. 

Though practically, many settings require learning with non-i.i.d. data — examples include time 
series data from financial problems, meteorological observations, and learning for predictive control — 
the generalization performance of statistical learning algorithms for non-independent data is per- 
haps not so well understood as that for the independent scenario. In spite of natural difficulties 
encountered with dependent data, several researchers have studied the convergence of statistical 



procedures in non i.i.d. settings 

123, 1 19 . l3Q . |22j. In such scenarios, one generally assumes that 



the data are drawn from a stationary a-, ft-, or (/>-mixing sequence, which implies that depen- 
dence between observations weakens suitably over time. Yu [2^] adapts classical empirical process 
techniques to prove uniform laws of large numbers for dependent data; perhaps a more direct 
parent to our approach is the work of Mohri and Rostamizadeh [22I], who combine algorithmic 
stability [H] with known concentration inequalities to derive generalization bounds. Steinwart and 
Christmann [2^] show fast rates of convergence for learning from stationary geometrically a-mixing 
processes, so long as the loss functions satisfy natural localization and self-bounding assumptions. 
Such assumptions were previously exploited in the machine learning and statistics literature for 
independent sequences (e.g. [2]), and Steinwart and Christmann extend these results by building 
off Bernstein-type inequalities for dependent sequences due to Modha and Masry [21[. 

In this paper, we show that online learning algorithms enjoy guarantees on generalization to 
unseen data for dependent data sequences from f}- and </>-mixing sources. In particular, we show 
that stable online learning algorithms — those that do not change their predictor too aggressively 
between iterations — also yield predictors with small generalization error. In the most favorable 
regime of geometric mixing, we demonstrate generalization error on the order of O (log n/y/n) after 
training on n samples when the loss function is convex and Lipschitz. We also demonstrate faster 
C(log n/n) convergence when the loss function is strongly convex in the hypothesis w, which is 
the usual case for regularized losses. In addition, we consider linear prediction settings, and show 
O (log n/n) convergence a loss that is strongly convex in its scalar argument (though not in the 
predictor w) is applied to a linear predictor (w, •), which gives fast rates for least squares SVMs, 
least squares regression, logistic regression, and boosting over bounded sets. We also provide an 
example and associated learning algorithm for which the expected regret goes to —00, while any 
fixed predictor has expected loss zero; this shows that low regret alone is not sufficient to guarantee 
small expected error when data samples are dependent. 

In demonstrating generalization guarantees for online learning algorithms with dependent data, 
we answer an open problem posed by Cesa-Bianchi et al. 0] on whether online algorithms give 
good performance on unseen data when said data is drawn from a mixing stationary process. Our 
results also answer a question posed by Xiao [28| regarding the convergence of the regularized dual 
averaging algorithm with dependent stochastic gradients. More broadly, our results establish that 
any suitably stable optimization or online learning algorithm converges in stochastic approximation 
settings when the noise sequence is mixing. There is a rich history of classical work in this area 
(see e.g. the book [3] and references therein), but most results for dependent data are asymptotic, 
and to our knowledge there is a paucity of finite sample and high probability convergence guaran- 
tees. The guarantees we provide have applications to, for example, learning from Markov chains, 
autoregressive processes, or learning complex statistical models for which inference is expensive (27| . 

Our techniques build off of a recent paper by Duchi et al. [10], where we show high probability 
bounds on the convergence of the mirror descent algorithm for stochastic optimization even when 



2 



the gradients are non-i.i.d. In particular, we build on our earlier martingale techniques, showing 
concentration inequalities for dependent random variables that are sharper than previously used 
Bernstein concentration for geometrically a-mixing processes 2l|, |26|] by exploiting recent ideas of 
Kakade and Tewari 17|, though we use weakened versions of c/>- mixing and /3-mixing to prove our 
high probability results. Further, our proof techniques require only relatively elementary martingale 
convergence arguments, and we do not require that the input data is stationary but only that it is 
suitably convergent. 



2 Setup, Assumptions, and Notation 

We assume that the online algorithm receives n data points x±,. . . ,x n from a sample space X, 
where the data is generated according to a stochastic process P, though the samples xt are not 
necessarily i.i.d. or even independent. The online algorithm plays points (hypotheses) w G W, and 
at iteration t the algorithm plays the point w(t) and suffers the loss F{w{t);xt). We assume that 
the statistical samples xt have a stationary distribution II to which they converge (we make this 
precise shortly), and we measure generalization performance with respect to the expected loss or 
risk functional 

f(w) := E n [F(w;x)] = [ F(w;x)dIL(x). (1) 

Jx 

Essentially, our goal is to show that after n iterations of any low-regret online algorithm, it is possible 
to use w(l), . . . ,w(n) to output a predictor or hypothesis w n for which f(w n ) is guaranteed to be 
small with respect to any other hypothesis w* . 

Discussion of our statistical assumptions requires a few additional definitions. The total varia- 
tion distance between distributions P and Q defined on the probability space (S, F) where J 7 is a 
cr-field, each with densities p and q with respect to an underlying measure /i0 is given by 

d TV (P,Q) := sup \P(A) - Q(A)\ = \ [ \p(s) - q(s)\d»(s). (2) 

Define the cr-field T% = cr(x\, . . . , xt). Let P^ denote the distribution of x t conditioned on F s , that 
is, given the initial samples x±, . . . ,x s . Written slightly differently, P}, = P (■ | F s ) is a version of 
the conditional probability of xt given the sigma field F s = cr(x\, . . . ,x s ). Our main assumption 
is that the stochastic process is suitably mixing: there is a stationary distribution II to which 
the distribution of xt converges as t grows. We also assume that the distributions P^ and II are 
absolutely continuous with respect to an underlying measure [i throughout. We use the following 
to measure convergence: 

Definition 2.1 (Weak /3 and 0-mixing). The f3 and (p-mixing coefficients of the sampling distribu- 
tion P are defined, respectively, as 

P(k) := sup{2E[ciTv(-P* +fc (- I ^),n)]) and <j>{k) := sup \2d T y(P t+k (- I B),U)\ . 

We say that the process is c/>-mixing (respectively, /3-mixing) if (p(k) — > (/3(k) — >• 0) as k — >■ oo, 
and we assume without loss that f3 and c/> are non-increasing. The above definitions are weaker than 

1 This assumption is without loss, since P and Q are each absolutely continuous with respect to the measure P + Q. 



3 



the standard definitions of mixing [i^, 0, [2^] , which require mixing over the entire future a- field of 
the process, that is, a(xt,xt+i,x t +2, ■ ■ •)• in contrast, we require mixing over only the single-slice 
marginal of a?t+fc. From the definition, we also see that /3- mixing is weaker than ^-mixing since 
f3(k) < 4>(k). We state our results in general forms using either the /3 or mixing coefficients of the 
stochastic process, and we generally use eft- mixing results for stronger high-probability guarantees 
compared to /3-mixing. We remark that if the sequence {xt} is i.i.d., then (f)(1) = /3(1) = 0. 

Two regimes of /3-mixing (and ^-mixing) will be of special interest. A process is called geo- 
metrically /3-mixing (^-mixing) if /3(h) < (3oexp(—pik e ) (respectively 4>(k) < 4>o exp(— <fiik 9 )) for 
some f3i,4>i,8 > 0. Some stochastic processes satisfying geometric mixing include finite-state er- 
godic Markov chains and a large class of aperiodic, Harris-recurrent Markov processes; see the 



references [2Q, [2l|] for more examples. A process is called algebraically /3-mixing (</>-mixing) if 
P(k) < (3ok~ e (resp. cj)(k) < 4>ok~ e ) for constants /3o,0O)# > 0. Examples of algebraic mixing 
arise in certain Metropolis-Hastings samplers when the proposal distribution does not have a lower 
bounded density [3], some queuing systems, and other unbounded processes. 

We now turn to stating the relevant assumptions on the instantaneous loss functions F(-; x) and 
other quantities relevant to the online learning algorithm. Recall that the algorithm plays points 
(hypothesis) w £ W. Throughout, we make the following boundedness assumptions on F and the 
domain W, which are common in the online learning literature. 

Assumption A (Boundedness). For \i- almost every (henceforth fi-a.e.) x, the function F(-;x) is 
convex and G-Lipschitz with respect to a norm \\-\\ over W: 

\F(w;x) -F(v;x)\ < G\\w-v\\ (3) 

for all w,v £ W. In addition, W is compact and has finite radius: for any w,w* G W, 

\\w - w*\\ <R. (4) 

Further, F(w\x) £ [Q,GR]. 

As a consequence of Assumption [A] / is also G-Lipschitz. Given the first two bounds ([3]) and (|4|) 
of Assumption [Aj the final condition can be assumed without loss; we make it explicit to avoid 
centering issues later. In the sequel, we give somewhat stronger results in the presence of the 
following additional assumption, which lower bounds the curvature of the expected function /: 

Assumption B (Strong convexity). The expected function f is X-strongly convex with respect to 
the norm \\-\\, that is, 

f(v ) > f( w ) + (fiS v ~ w ) + ^ \\ w ~ v \\ 2 for w,v £ W and for all g £ df(w). (5) 

Lastly, to prove generalization error bounds for online learning algorithms, we require them to 
be appropriately stable, as described in the next assumption. 

Assumption C. There is a non-increasing sequence K,(t) such that if w(t) and w(t + 1) are suc- 
cessive iterates of the online algorithm, then \\w(t) — w(t + 1)|| < n(t). 

Here ||-|| is the same norm as that used in Assumption [A] We observe that this stability assumption 



is different from the stability condition of Mohri and Rostamizadeh 22] and neither one implies the 
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other. It is common (or at least straightforward) to establish bounds K(t) as a part of the regret 
analysis of online algorithms (e.g. [28l|). which motivates our assumption here. 

What remains to complete our setup is to quantify our assumptions on the performance of the 
online learning algorithm. We assume access to an online algorithm whose regret is bounded by 
(the possibly random quantity) 9\ n for the sequence of points xi, . . . ,x n G X, that is, the online 
algorithm produces a sequence of iterates w(l), . . . , w(n) such that for any fixed w* G W, 

n 

F{w(t);x t ) - F(w*,x t ) < m n . (6) 

t=l 

Our goal is to use the sequence w(l), . . . , w(n) to construct an estimator w n that performs well 
on unseen data. Since our samples are dependent, we measure the generalization error on future 



test samples drawn from the same sample path as the training data 221] . That is, we measure 
performance on the m samples x n+ i, . . . ,x n+m drawn from the process Fr n u and we would like to 
bound the future risk of w n , defined as 

j m 

- V E [F(w n ; x n+t ) - F(w* ; x n+t ) \ F n ] , (7) 
m ^-^ 
t=i 

the conditional expectation of the losses F(w n ;x) given the first n samples. Note that in the i.i.d. 
setting [3], the expectation above is the excess risk f(w n ) — f( w *) of w n against w*, because x n+ t 
is independent of x%, . . . ,x n . Of course, we are in the dependent setting, so the generalization 
measure (|T|) requires slightly more care. 
1 



3 Generalization bounds for convex functions 

Our definitions and assumptions in place, we show in this section that any suitably stable online 
learning algorithm enjoys a high-probability generalization guarantee for convex loss functions F. 
The main results of this section are Theorems [2] and El which give high probability convergence of 
any stable online learning algorithm under (p- and /3-mixing, respectively. Following Theorem [21 
we also present an example illustrating that low regret is by itself insufficient to guarantee good 
generalization performance, which is distinct from i.i.d. settings 0]. 

Before proceeding with our technical development, we describe the high-level structure and 
intuition underlying our proofs. The technical insight underpinning many of our results is that 
under our mixing assumptions, the distribution of the random instance xt+ T is close to the stationary 
distribution conditioned on Ft- That is, looking some number of steps r into the futre from a time t 
is almost as good as obtaining an unbiased sample from the stationary distribution II. As a result, 
the loss F(w(t); x t + T ) is a good proxy for f(w(t)), since w(t) only depends on x%, . . . , x%-\. LemmaQ] 
formalizes this intuition. (Duchi et al. [10( use a similar technique as a building block.) Under our 
stability condition, we can further demonstrate that F(w(t); xt+ T ) is close to F(w(t + r); a?t+r)i 
and the behavior of the latter sequence is nearly the same as the sequence F(w(t); xt) with respect 
to which the regret 9\ n is measured. We make these these ideas formal in Propositions [H and El 
We then combine our intermediate results (including bounds on the regret 9tn), a Pplyi n g relevant 
martingale concentration inequalities, to obtain the main theorems of this and later sections. 

Our starting point is the above-mentioned technical lemma that underlies many of our results. 
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Lemma 1. Let w,v E W be measurable with respect to the a-field Ft and Assumption Q] hold. 
Then for any r € N, 



and 



E 



E[F(w; x t+T ) - F(v; x t+T ) | F t ] < f(w) - f(v) + GR<P(t). 



\E[F(w; x t+T ) - F(v; x t+T ) \ F t ] - (f(w) - f(v)) \ < GR(3(t). 



Proof We first prove the result for the ^-mixing bound. Recalling that f(w) = Eji[F(w; x)] and 
the definition of the underlying measure \x and the densities it and p, 

E[F(w; x t+T ) - F(v; x t+T ) \ F t ] = E[F(w; x t+T ) - f(w) + f(v) - F(v; x t+T ) \ F t ] + f(w) - f(v) 



X 



[F(w; x) - F(v; x)](p\+ T (x) - ir(x))dfi(x) + f(w) - f(v) 



< / \F(w; x) — F(v; x) 
Jx 



t+T. 

M 1 



x) - tt(x) d(x{x) + f(w) - f(v) 



< GR 



p\+ T {x)-K(x) d^{x) + f{w)-f{v) 



2GR ■ dTv(P$ T , n) + f(w) - f(v), 



where for the second inequality we used the Lipschitz assumption [A] and the compactness assump 

[*] 



tion on W. Noting that 2dxv(-fr/i^ T ! n) < 4>(t) by the definition 12.11 completes the proof of the first 



part. 

To see the second inequality using /3-mixing coefficients, we begin by noting that as a conse- 
quence of the proof of the first inequality, 

E[F(w;x t+T )-F(v;x t+T ) | F t ] - (f(w) - /(«)) < 2GM TV (P^ T , n), 



and the inequality holds with w and v switched: 



E[F(v;x t+T ) - F{w-x t+T ) \ F t ] - (f(v) - f(w)) < 2GRd TY (P^ T ,U) 



Combining the two inequalities and taking expectations, we have 



E 



\E[F(w;x t+T ) - F{v-x t+T ) \ F t ] - (/(to) - f{v))\ < 2GRE [d TV (P t+T (- \ F t ),U)] < GR(3(t) 



by the definition 12.11 of the mixing coefficients. 



□ 



Using Lemma [H we can give a proposition that relates the risk on the test sequence to the 
expected error of a predictor w under the stationary distribution. The result shows that for any w 
measurable with respect to the cr-field T n — we use w n G F n , the (unspecified as yet) output of the 
online learning algorithm — we can prove generalization bounds by showing that w has small risk 
under the stationary distribution II. 

Proposition 1. Under the Lipschitz assumption\A\ for any w G W measurable with respect to F n , 
any w* 6 W, and any r G N, 

- V E[F(w;x t )-F(w*;x t )\F n }<f(w)-f(w*) + <P(r)GR+ [ - >- 

m m 

t=n+l 
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and 

n+m 



E 



- V E [F(w; x t ) - F(w* ;x t ) \ F n ] 

n. ^ — * 



m 

t=n+l 



<E[f(w)}-f(w*) + f3(r)GR+ iT l)CN 



in 



Proof The proof follows from the definition 12. II of mixing. The key idea is to give up on the first 
r — 1 test samples and use the mixing assumption to control the loss on the remainder. We have 



n+m 



E[F(u>; Xt) — F(w*;x t ) | F n ] 

=71+1 

n+T—1 n+m 

= nF(w;x t )-F(w*;x t )\F n }+ ^ E[F(w; x t ) - F(w*; x t ) \ T n - 

t=n+l t=n+T 
n+m 

<(t-1)GR+ E[F(w;x t )-F(w*;x t )\T n ) 



t=n+r 



since by the Lipschitz assumption lAl and compactness F(w;x) — F(w*;x) < GR. Now, we apply 
Lemma [T] to the summation, which completes the proof. □ 



Proposition [T] allows us to focus on controlling the error on the expected function / under the 
stationary distribution II, which is a natural convergence guarantee. Indeed, the function / is 
the risk functional with respect to which convergence is measured in the standard i.i.d. case, and 
applying Proposition Q] with r = 1 and 0(1) = (or /3(1) = 0) confirms that the bound is equal to 
f(w) — f(w*). We now turn to controlling the error under /, beginning with a result that relates risk 
performance of the sequence of hypotheses w(l), . . . , w(n) output by the online learning algorithm 
to the algorithm's regret, a term dependent on the stability of the algorithm, and an additional 
random term. This proposition is the starting point for the remainder of our results in this section. 

Proposition 2. Let Assumptions^ and[0 hold and let w(t) denote the sequence of outputs of the 
online algorithm. Then for any t £ N, 

n n 

Y f(w(t)) - f(w*) < v\ n + Gt Y + 2t GR 
t=i t=i 



+ E t^)) " F(w(t);x t+T ) + F(w*;x t+T ) - f(w*)} . (8) 
t=i 

Proof We begin by expanding the regret of w(t) on sequence / via 

n 

Ylf(w(t))-f(w*)] 

t=l 

n 

]T [f(w(t)) - F(w(t);x t+T ) + F(w(t);x t+T ) - f(w*)} 
t=i 

n 

Y lf(w(t)) ~ F(w(ty,x t+ r) + F(w*;x t+T ) - f(w*) + F(w(t); x t+T ) - F(w*;x t+T )} . (9) 



t=l 
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Now we use stability and the regret guarantee ([6]) to bound the last two terms of the summation Q. 
To that end, note that 

n 

Y,[F(w(t);x t+T ) - F(w*;x t+T )} 

n n—T 

[F(w(t);x t ) - F(w*;x t )} + ^ [F(w(t);x t+T ) - F(w(t + r); x t+T )} 



t=i 

n 



t=l t=l 



Si S 2 
n t t n+T 

t=n-r+l t=l t=l t=n+l 



We now bound the three terms in the summation. S3 is bounded by 2tGR under the boundedness 
assumption 1X1 and the regret bound © guarantees that Si < 9\ n . Using the stability assumption^ 
we can bound S2 by noting 



T-l 



F(w(t);x t+T ) - F(w{t + T)-x t+T ) < G\\w(t) - w(t + r)|| < + s) < GrK(t), 



s=0 



where the last step uses the non-increasing property of the coefficients n(t). Substituting the bounds 
on Si, S2, and S3 into Eq. ([9]) completes the proposition. □ 



The remaining development of this section consists of using the key inequality ([8]) in Proposi- 
tion [2] to give expected and high-probability convergence guarantees for the online learning algo- 
rithm. Throughout, we define the output of the online algorithm to be the averaged predictor 

1 n 

w n = -J2w(t). (10) 

t=i 

We begin with results giving convergence in expectation for stable online algorithms. 

Theorem 1. Under Assumptions\A\ anc/Q for any r G N the predictor w n satisfies the guarantee 

nf(Wn)} ~ f(w*) < -n<Rn] + 0(t)GR + (T ~ 1)G ( 2R + V K(t)) , 

n n V ^ / 

for any w* G W. 

Proof From the inequality (|8j) in Proposition [21 what remains is to take the expectation of the 
random quantities. To that end, we note that w{t) is measurable with respect to Tt-i (since the 
iterate at time t depends only on first t — 1 samples) and apply Lemma [IJ which gives 

E [E[F(w*; x t+T _i) - F(w(t);x t+T ^) | < f(w*) - f(w(t)) + GRftr). 
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Adding the difference to the sum ([8]) with the setting r i— > (r — 1) gives 



E 



f(w(t)) - f(w*) < R[Mn] + G(t + 2 ( r " X ) GjR + nGR Pi T ) 



t=i 



Dividing by n and observing that f(w n ) < - Y2t=i f( w (^)) by Jensen's inequality completes the 
proof. □ 



We observe that setting r = 1 and /3(1) = recovers an expected version of the results of Cesa- 
Bianchi et al. [7|, Corollary 2] for i.i.d. samples. Theorem[T]combined with PropositionQ]immediately 
yields the following generalization bound. Our other results can be similarly extended, but we leave 
such development to the reader. 

Corollary 2. Under Assumptions^ and\Q for any r G N the predictor w n satisfies the guarantee 



m 



n+m -| / 2 1 1 n \ 

y F(w n ;x t )-F(w*;x t ) < -E[m n ] + 2/3(r)GR + (r - l)GR - + - + - V «(*) ) 

t=n+l J v t=l ' 



It is clear that the stability assumption we make on the online algorithm plays a key role in our 
results whenever r > 1, that is, the samples are indeed dependent. It is natural to ask whether this 
additional term is just an artifact of our analysis, or whether low-regret by itself ensures a small 
error under the stationary distribution even for dependent data. The next example shows that low 
regret — by itself — is insufficient for generalization guarantees, so some additional assumption on 
the online algorithm is necessary to guarantee small error under the stationary distribution. 



Example 1 (Low-regret does not imply convergence) 
F(w;x) = (w,x), where x £ {—1,1} and the set W = 
dependent sampling process: at each time t, set 



In 1-dimension, define the linear loss 
— 1,1]. Let p > and define following 




with probability p/2 
with probability p/2 
with probability 1 — p. 



The stationary distribution II is uniform on {—1, 1}, so the expected error En[(ui,x)] = for any 
w £ W. However, we can demonstrate an update rule with negative expected regret as follows. 
Consider the algorithm which sets w(t) = —xt-\, implementing a trivial so-called follow the leader 
strategy. With probability 1 — p/2, the value (w(t),xt) = —1, while (w(t),xt) = 1 with probability 
p/2. Consequently, the expectation of the cumulative sum Ylt=i F( w (t)i x t)is —(1 — p)n. Using 
standard results on the expected deviation of the simple random walk (e.g. [4[]), we know that 



E 



inf y ( 



w,x t ) 



t=i 



-E 



n 

yx t 

t=i 



e(-v^). 



We are thus guaranteed that the expected regret of the update rule is —^((1 — p)n). 
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Figure 1. The r different blocks of near- martingales used in the proof of Theorem [5] Black boxes 
represent elements in the same index set X(l), gray in X(2), and so on. 



We have now seen that it is possible to achieve guarantees on the generalization properties of 
an online learning algorithm by taking expectation over both the training and test samples. We 
would like to prove stronger results that hold with high probability over the training data, as is 
possible in i.i.d. settings [7]. The next theorem applies martingale concentration arguments using 
the Hoeffding-Azuma inequality [H to give high-probability concentration for the random quantities 
remaining in Proposition's bound. 

Theorem 2. Under Assumptions^ and\Q with probability at least 1 — 5, for any r € N and any 
w* 6 W the predictor w n satisfies the guarantee 



f(wn) - f(w*) < -K n + (T 1)G V «(t) + 2GR\l-\og T - + <j>{r)GR + 2(T 1)GR . 

n n j' V n o n 

Proof Inspecting the inequality ([8]) from Proposition [2j we observe that it suffices to bound 

n 

Z n := [/K*)) " /K) " F(w(t);xt+r-i) + F(w*;x t+T ^)} (11) 
t=i 

This is analogous to the term that arises in the i.i.d. case {tJ, where Z n is a bounded martingale 
sequence and hence concentrates around its expectation. Our proof that the sum (lll|) concentrates 
is similar to the argument Duchi et al. [l(| use to prove concentration for the ergodic mirror descent 
algorithm. The idea is that though Z n is not quite a martingale in the general ergodic case, it is in 
fact a sum of r near-martingales. This technique of using blocks of random variables in dependent 
settings has also been used in previous work to directly bound the moment generating function 



of sums of dependent variables 2l|, though our approach is different. See Fig. Q] for a graphical 
representation of our choice Q12p of the martingale sequences. 

For i £ {1, . . . , t} and t G {1, . . . , [n/r] }, define the random variables 

X\ := f(w((t - l)r + i)) - /(to*) + F(w*;xt T +i-i) - F(w((t - l)r + i); x tT+l ^). (12) 

In addition, define the associated u-fields T\ := Ftr+i-i = o~(x%, . . . ,x tr +i-i). Then it is clear that 
X\ is measurable with respect to T\ (recall that w(t) is measurable with respect to Ft-i), so the 
sequence X\ — EpQ | J^-i] defines a martingale difference sequence adapted to the filtration T\, 



t = 1,2, . . .. Following previous subsampling techniques [2l|, I id ], we define the index set to 
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be the indices {1, . . . , L n / r J + 1} f° r i < n — t \ji/t\ and {1, . . . , L n / T J} otherwise. Then a bit of 
algebra shows that 



z n = E E l x i - E ^ i rt-i}] + E E e k i -ti] 



(13) 



The first term in the decomposition (|13p is a sum of t different martingale difference sequences. 
In addition, the boundedness assumption [A] guarantees that \X\ — EpQ | .77- ill < 2GR, so each of 
the sequences is a bounded difference sequence. The Hoeffding-Azuma inequality [H then guarantees 



E [xi - mi i Jtii] > t 



(14) 



To control the expectation term from the second sum in the representation (|13p . we use mixing. 
Indeed, Lemma [1] immediately implies that EpQ | < GR4>(t). Combining these bounds with 

the application (|14() of Hoeffding-Azuma inequality, we see by a union bound that 



(Z n > nGRcPir) + 7 ) < E P E l X t ~ E K I -^-i]] ^ 

»=i L tex(i) 



< rexp 



7 



8rnG 2 R 2 J ' 

Equivalently, by setting 7 = 2GRy / 2nr \og(r/5), we obtain that with probability at least 1 — 6, 



Z n <GR[ nSir) +2 J 2nr log - 



Dividing by n and using the convexity of / as in the proof of Theorem [T] completes the proof. □ 



To better illustrate our results, we now specialize them under concrete mixing assumptions in 
several corollaries, which should make clearer the rates of convergence of the procedures. We begin 
with two corollaries giving generalization error bounds for geometrically and algebraically (/>-mixing 
processes (defined in Section [2]). 

Corollary 3. Under the assumptions of Theorem^ assume further that (f>(k) < cexp(— 4>\k e ) for 
some universal constant c. There exists a finite universal constant C such that with probability at 
least 1 — 6, for any w* G W 

f(Wn)-f(w*) <-d\ n + C- 

n 

The corollary follows from Theorem [2] by taking r = (log n/(20i)) 1 / 6 '. When the samples x± come 
from a geometrically ^-mixing process, Corollary [3] yields a high-probability generalization bound 
of the same order as that in the i.i.d. setting [7| up to poly-logarithmic factors. Algebraic mixing 
gives somewhat slower rates: 



GR, 



neb-, 



+ 1 



' (logn)V (logra)V^ 
n6 l J e g ^^~ 
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Corollary 4. Under the assumptions of Theorem^ assume further that 4>{k) < 4>ok~ s . Define 
K n = ^2t = in(t)/R. There exists a finite universal constant C such that with probability at least 
1 — 5, for any w* £ W 



f{w n )-f{w*) < hfin + C 



1 n 



1+8 I -"-n 



l+e i 
+ GR^ +1 



-1 
26+2 



1 



+ 1 



• log 



n 



K n 5 



o(n), the bound in Corollary 0] converges to 0. In addition, we remark that 



The corollary follows by setting r = <Pq K " 1 ' x ' (n/ 1 K n ) L/[ - y+l K So long as the sum of the stability 



constants Yut=l K (t) 

under the same condition on the stability, an argument similar to that for Corollary 7 of Duchi et 
al. [13] implies f(w n ) — f(w*) — > almost surely whenever (j)(k) — > as k — > oo. 

To obtain concrete generalization error rates from our results, one must know bounds on the 
stability sequence K,(t) (and the regret D\ n ). For many online algorithms, the stability sequence 
satisfies n(t) oc 1/Vt; including online gradient and mirror descent [9(]. As a more concrete example, 
consider Nesterov's dual averaging algorithm [231 ]. which Xiao extends to regularized settings [28]. 
For convex, G-Lipschitz functions, the dual averaging algorithm satisfies d\ n = 0(GR^/n), and 
with appropriate stepsize choice 28j, Lemma 10] proportional to y/i, one has n(t) < R/y/t. Noting 
that Y^!t=i t^ 1 ^ 2 ^ 2- v /n, substituting the stability bound into the result of Theorem [2] immediately 
yields the following: there exists a universal constant C such that with probability at least 1 — 5, 



f{w n )-f{w*) < -Kn + C-mf 
n tsN 



GR(t 



i) gr r 7 



[t)GR 



(15) 



The bound (fT5|) captures the known convergence rates for i.i.d. sequences 0,123] by taking r = 1, 
since 4>(1) = in i.i.d. settings. In addition, specializing to the geometric mixing rate of Corollary[3] 

one obtains a generalization error bound of O (J^l + -^j to poly-logarithmic factors. 

Theorem [2] and the corollaries following require ^-mixing of the stochastic sequence x±,X2,- ■ ■, 
which is perhaps an undesirably strong assumption in some situations (for example, when the sample 
space X is unbounded). To mitigate this, we now give high-probability convergence results under 
the weaker assumption that the stochastic process P is /3-mixing. These results are (unsurprisingly) 
weaker than those for eft- mixing; nonetheless, there is no significant loss in rates of convergence as 
long as the process P mixes quickly enough. 



Theorem 3. Under Assumptions^ and\C^ with probability at least 1 — 26, for any r € 

all w* S W the predictor w n satisfies the guarantee 



and for 



f(Wn) ~ f(w*) < 

n 



+ (r-l)G " <t) + 2GR 



2t 2r 2Mt)GR 2(r - 1)GR 
— log — + + ^ . 

nod n 



Proof Following the proof of Theorem [2j we construct the random variables Z n and X\ as in 
the definitions (fTTj) and (fT2|) . Decomposing Z n into the two part sum (fT3|) . we similarly apply the 
Hoeffding-Azuma inequality (as in the proof of Theorem [2]) to the first term. The treatment of the 
second piece requires more care. 

Observe that for any fixed i,t, the fact that w((t — l)r + i) and w* are measurable with respect 
to J~l_i guarantees via Lemma [T] that 

E[|E[*j| j£_i]|] <GRP(t). 
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Applying Markov's inequality, we see that with probability at least 1 — 5, 

± £ " W i *d £ 

i=i tez(i) 

Continuing as in the proof of Theorem [2] yields the result of the theorem. □ 

Though the 1/5 factor in Theorem [3] may be large, we now show that things are not so difficult 
as they seem. Indeed, let us now make the additional assumption that the stochastic process 
x\,X2, ■ ■ ■ is geometrically (3- mixing. We have the following corollary. 

Corollary 5. Under the assumptions of Theorem^ assume further that /3(k) < /3o exp(— j3\k d ). 
There exists finite universal constant C such that with probability at least 1 — 1 jn for any w* G W 

f( Wn )-f(w*) < -K n +C- 

n 

The corollary follows from Theorem by setting r = (1.51ogn//3i) 1//e and a few algebraic ma- 
nipulations. Corollary [5] shows that under geometric /3-mixing, we have essentially identical high- 
probability generalization guarantees as we had for mixing (cf . Corollary [3]) , unless the desired 
error probability or the mixing constant 6 is extremely small. We can make similar arguments 
for polynomially /3-mixing stochastic processes, though the associated weakening of the bound is 
somewhat more pronounced. 

4 Generalization error bounds for strongly convex functions 

It is by now well-known that the regret of online learning algorithms scales as O(logn) for strongly 
convex functions, results which are due to work of Hazan et al. [12]. To remind the reader, we 
recall Assumption [Bj which states that a function / is A-strongly convex with respect to the norm 
||-|| if for all g G df(w), 

A ll2 

f{v) > f(w) + (g, v — w) + — \\w — v\\ for w, v £ W. 

For many online algorithms, including online gradient and mirror descent [3I. [T3. 
averaging [28, Lemma 11], the iterates satisfy the stability bound \\w(t) — w(t + 1)|| < G/(Xt) when 
the loss functions F(-,x) are A-strongly convex. Under these conditions, Corollary [2] gives expected 
generalization error bound of C(inf re N {/3(r) + r log n/n}) as compared to 0(inf re N{/3(r)d-y / r/n}) 
for non-strongly convex problems. The improvement in rates, however, does not apply to Theo- 
rem's high probability results, since the term controlling the fluctuations around the expectation 
of the martingale we construct scales as Oj^/r/n). That said, when the samples xt are drawn i.i.d. 
from the distribution II, Kakade and Tewari [17|] show a generalization error bound of 0(log n/n) 
with high probability by using self-bounding properties of an appropriately constructed martin- 
gale. In the next theorem, we combine the techniques used to prove our previous results with a 
self-bounding martingale argument to derive sharper generalization guarantees when the expected 
function / is strongly convex. Throughout this section, we will focus on error to the minimum of 
the expected function: w* G argmin^gyv f(w). 



(1.5 log n) l / e G 



l/e 



(1.51ogn)V* , B GR 
- log (^(logn) 1 /^] + 



t=i 



up 1 ' 



n 



24 1 and dual 
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Theorem 4. Let Assumptions\^\^ and\Qhold, so the expected function f is X-strongly convex 
with respect to the norm \\-\\ over W. Then for any 5 < 1/e, n > 3, with probability at least 
1 — 4(5 log n, for any r 6 N the predictor w n satisfies 



f(w n ) - f(w*) < -m n + 2(r 1)G 

n n 



, + 2R\+ log - + — — log - + 2RG<t>{r). 



Before we prove the theorem, we illustrate its use with a simple corollary. We again use Xiao's 
extension of Nesterov's dual averaging algorithm 23|, 28], where for G-Lipschitz A-strongly convex 
losses F it is shown that 

\\ X (t) - X(t + < K(t) < y f 

Consequently, Theorem [4] yields the following corollary, applicable to dual averaging, mirror descent, 
and online gradient descent: 

Corollary 6. In addition to the conditions of Theorem^ assume the stability bound n(t) < G/Xt. 
There is a universal constant C such that with probability at least 1 — 5 log n, 



f(w n )-f(w*) < -K n + C- inf 
n r&i 



(r ~ l)G 2 
Xn 



log n + 



tG 



G 2 



Xn lo ^ + T^) 



Proof The proof follows by noting the following two facts: first, Y2t=i K ft) — + logn), 

and secondly, the definition ([5]) of strong convexity implies 



A 



w) + - 



w\\ 



G\\w-v\\>f(v)-f(w)>(Vf(w),v 
Recalling that \\Vf(w)\\ t < G, we have \\w - v\\ < 4G/X for all w, v £ W, so R < 2G/X. □ 



We can further extend Corollary [6] using mixing rate assumptions on 4> a s in Corollaries [3] and [H 
though this follows the same lines as those. For a few more concrete examples, we note that online 
gradient and mirror descent as well as dual averaging |28j all have D\ n < C • (G 2 /A) log n 

when the loss functions F(-;x) are strongly convex (this is stronger than assuming that the expected 
function / is strongly convex, but it allows sharp logarithmic bounds on the random quantity 9\ n ). 
In this special case, Corollary [6] implies the generalization bound 



/(,?•„)-/(„•*) C)(^inf 



logn 
r h 

n 



with high probability. For example, online algorithms for SVMs (e.g. |25|) and other regularized 
problems satisfy a sharp high-probability generalization guarantee, even for non-i.i.d. data. 
We now turn to proving Theorem U beginning with a martingale concentration inequality. 



Lemma 7 (Freedman [111 ]. Kakade and Tewari [17|]). Let X\, . . . , X n be a martingale difference 
sequence adapted to the filtration Ft with \Xt\ < b. Define V = Ylt=i ^i-^-t I J~t-i }■ F° r an U 5 < 1/e 
and n > 3 



n 

E 

,t=i 



X t > max{2\/y, Sb^logl/d}^^ 1/5 



< 45 log n. 
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Proof of Theorem [4] For the proof of this theorem, we do not start from the Proposition [2J as 
we did for the previous theorems, but begin directly with an appropriate martingale. Recalling the 
definition (|12p of the random variables X\ and the u-fields T\ = cr(x\, . . . , xt T +i-i) from the proof 
of Theorem [21 our goal will be to give sharper concentration results for the martingale difference 
sequence X\ — E[JQ | J? J. To apply Lemma [71 we must bound the variance of the difference 
sequence. To that end, note that the conditional variance is bounded as 

E^-EKI^]) 2 !^] 
< E [(Xl f | Jti] 



E 



(f(w((t - l)r + t)) - /(to*) - F(w((t - l)r + i); x Tt+i ^) + F(w* ; x ir+i _!)) 2 | 



< 4G 2 |H(i- + -u>*|| 2 , 

where in the last line we used the Lipschitz assumption lAl and the fact that w((t — l)r + i) G J'l-i- 
Of course, since ui* minimizes /, the A-strong convexity of / implies (see e.g. (l5| ) that for any 
w G W, /(w) — f(w*) > ^\\w — w*\\ 2 . Consequently, we see that 

RG 2 

E [(XI - E[X* | r t _,}) 2 I Jti] < — " l)r + 0) " /K)] • (16) 



What remains is to use the single term conditional variance bound fjl6[> to achieve deviation 
control over the entire sequence X\. To that end, recall the index sets X(i) defined in the proof of 
Theorem O and define the summed variance terms Vi := ^ tgZ ^E[(X| — ~K[X\ | F^^) 2 | Fl-\[- 
Then the bound (|16p gives 

E [/(«/(r(i - 1) + •))-/(*•)] . 
tex(i) 

Using the preceding variance bound, we can apply Freedman's concentration result (Lemma [7]) to 
see that with probability at least 1 — (AS log n)/r, 

]T (XI - ELY* | Jtil) < max {2^, 6G J R0og(r/5)} TbgC^) (17) 

tez(i) 

We can use the inequality ()17p to show concentration. Define the summations 

S l := ^ f(w(T(t-l)+i))-f(w*) and S t := ^ F(w(r(t-l) x rt+i ^) - F(w*; x rt+l ^). 
teZ(i) tez(i) 



Then the definition (|12|) of the random variables X\ coupled with the inequality (|17|) implies that 
Si < S{ + max 



32G 2 



A 



teZ(i) 



<Si + \/ x^VSi + QGRlog - + \X(i)\4>(r)RG, 



32G 2 logf ^ _ r 
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where we have applied Lemma [TJ Solving the induced quadratic in y/Sl, we see 



8G 2 log^ , /8G 2 , r , - , „™ r 



Si < \J A + V~ l0g 5 +5i + l J WI <Kt)RG + 6GR log-. 

Squaring both sides and using that (a + 6) 2 < 2a 2 + 2b 2 , we find that 

viC 2 t ^ T 
Si < -—log- + 2Si + 12GRlog- + 2\X(i)\ct)(T)RG (18) 

AO 

with probability at least 1 — 45 log n/r. 

We have now nearly completed the proof of the theorem. Our first step for the remainder is to 
note that 

t n 

S ^ = E /(«(*)) -/M 

8=1 t=l 

Applying a union bound, we use the inequality (fTHj) to see that with probability at least 1 — 45 log n, 

E f(w(t)) - f(w*) <2j2Si+ — r— log f + 12rG12Iog ^ + 2n<f>(r)RG. 
t=l i=i 

All that remains is to use stability to relate the sum YlJ=i &i t° t ne regret 9^, which is similar to 
what we did in the proof of Proposition [2l Indeed, by the definition of the sums Si we have 



E = E F(w(t);x t+T -i) - F(w*;x t+T -i) 
i=l t=i 

n n—T 

= ^ F(w(t);x t ) - F(w*;x t ) + ^ F(w(t); x t+T -i) - F(w(t + r - 1); st+r-i) 
t=l t=i 

t—1 n+T— 1 n r— 1 

t=l i=n+l t=n-r+l t=l 

n 

<<R n + 2{r -1)GR + {t -l)G^n{t), (19) 

t=l 

where the inequality follows from the definition ([U]) of the regret, the boundedness assumption \K\ 
and the stability assumption O Applying the final bound, we see that 

n n 32r 2 T r r 

V f(w{t))-f{w*) < 29\ n +2(T-l)G Y K{t) + ^— log -+12rGi21og -+2ti^(t)RG+4(t-1)RG 

ti ti A 6 6 

with probability at least 1 — 45 log n. Dividing by n and applying Jensen's inequality completes the 
proof. □ 

We now turn to the case of /3-mixing. As before, the proof largely follows the proof of the 
^-mixing case, with a suitable application of Markov's inequality being the only difference. 
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Theorem 5. In addition to Assumptions \A\ and\Q assume further that the expected function f 
is X-strongly convex with respect to the norm \\-\\ over W. Then for any 5 < 1/e, n > 3, with 
probability greater than 1 — 55 log n, for any t £ N the predictor w n satisfies 

, , 2 2(r - 1)G , . \ 32G 2 t r YItRG , 2r 2RG/3(t) 

n n I \n o n o 5 

Proof We closely follow the proof of Theorem [H Through the bound (|17p . no step in the proof of 
Theorem 0] uses 0-mixing. The use of </>-mixing occurs in bounding terms of the form EfJQ | 
Rather than bounding them immediately (as was done following Eq. (I17D in the proof of TheoremUJ , 
we carry them further through the steps of the proof. Using the notation of Theorem Sfs proof, in 
place of the inequality (fl8l) , we have 



09/^2 _ _ 
Si < — log - + 2S, + 12GR\og - + Y, E i X t\ r t-i] 

tex(i) 

with probability at least 1 — 45 log n/r. Paralleling the proof of Theorem [U we find that with 
probability at least 1 — 45 log n, 

n 

£/Mt))-/(tO (20) 
t=i 

< 2<H n + 2(r - 1)G K (t) + l °S g + VlrGR log - + 4(r - l)i?G + ^E[^| T\_^ . 

As in the proof of Theorem [31 we apply Markov's inequality to the final term, which gives with 
probability at least 1 — 5 

i=l tei(i) 

Substituting this bound into the inequality (j20j) and applying a union bound (noting that 5 < 
5 log n) completes the proof. □ 



As was the case for Theorem [3j when the process xi,X2, ■ ■ ■ is geometrically /3-mixing, we can 
obtain a corollary of the above result showing no essential loss of rates with respect to geometrically 
(p- mixing processes. We omit details as the technique is basically identical to that for Corollary [5j 

5 Linear Prediction 

For this section, we place ourselves in the common statistical prediction setting where the statistical 
samples come in pairs of the form (x, y) G X x y, where y is the label or target value of the sample 
x, and the samples are finite dimensional: X C M. d . Now we measure the goodness of the hypothesis 
w on the example (x, y) by 

F(w ] (x,y))=£{y,(x,w}), I : y x R -> R, (21) 
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where the loss function £ measures the accuracy of the prediction (x,w). An extraordinary 
number of statistical learning problems fall into the above framework: linear regression, where 
the loss is of the form £(y,(x,w}) = ^(y — (x,w)) 2 ; logistic regression, where £(y,(x,w}) = 
log(l + exp(— y (x,w)))\ boosting and SVMs all have the form (pT|) . 

The loss function (|2ip makes it clear that individual samples cannot be strongly convex, since 
the linear operator (x, •) has a nontrivial null space. However, in many problems, the expected 
loss function f(w) := En[F(w; (x, y))] is strongly convex even though individual loss functions 
F(w; (x,y)) are not. To quantify this, we now assume that ||x|| 2 < r for fi-a.e. x 6 X, and make 
the following assumption on the loss: 

Assumption D (Linear strong convexity). For fixed y, the loss function £(y,-) "is a X-strongly 
convex and L-Lipschitz scalar function over [—Rr,Rr]: 

£(y,b)>£(y,a)+£'(y,a)(b-a) + ^(b-a) 2 and \£(y,b) - £(y,a)\ < L\a - b\ 

for any a, b £ M with max{|a|, \b\} < Rr. 

Our choice of Rr above is intentional, since {x, w) < Rr by Holder's inequality and our com- 
pactness assumption A few examples of such loss functions include logistic regression and 
least-squares regression, the latter of which satisfies Assumption ITJ1 with A = 1. To see that the 
expected loss function satisfying Assumption |Pl is strongly convex, note thatH 



f(v) = E n [£(y,(x,v))} 



A 2 



£{y, (x, w)) + £'(y, (x, w))((x, v) - (x, w)) + -((x, v) - (x, w}) 

= E n [F(w, (x, y)) + {VF(w; (x, y)),v - w)} + ^E n [(x, v) 2 + (x, w) 2 - 2 (x, w) (x, v)} 

= f(w) + (Vf(w),v - w) + ^ (Cav(x)(w -v),w-v), (22) 

where Cov(x) is the covariance matrix of x under the stationary distribution n. So as long as 
A m in(Cov(x)) > 0, we see that the expected function / is A • A m i n (Cov(x))-strongly convex. 

If we had access to a stable online learning algorithm with small (i.e. logarithmic) regret for 
losses of the form (|2ip satisfying Assumption [Dl we could simply apply Theorem H] and guarantee 
good generalization properties of the predictor w n the algorithm outputs. The theorem assumes 
only strong convexity of the expected function /, which — as per our above discussion — is the case 
for linear prediction, so the sharp generalization guarantee would follow from the inequality (I22p . 
However, we found it difficult to show that existing algorithms satisfy our desiderata of logarithmic 
regret and stability, both of which are crucial requirements for our results. Below, we present 
a slight modification of Hazan et al.'s follow the approximate leader (FTAL) algorithm to 
achieve the desired results. Our approach is to essentially combine FTAL with the Vovk-Azoury- 
Warmuth forecaster 0, Chapter 11.8], where the algorithm uses the sample x to make its prediction. 
Specifically, our algorithm is as follows. At iteration t of the algorithm, the algorithm receives xt, 



2 For notational convenience we use Vi 7 to denote either the gradient or a measurable selection from the subgradient 
set dF; this is no loss of generality. 
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plays the point w(t), suffers loss F(w(t);(xt,yt)), then adds VF(w(t); (x t ,yt)) to its collection of 
observed (sub)gradients. The algorithm's calculation of w(t) at iteration t is 

(t) = argmin < (VF(w(i); (xi,yi)),w) + - - w,Xj) 2 + --u; T (x t x7 + el)w > . (23) 



2 ^ N w ' " 2 
. i=i i=l 



The algorithm above is quite similar to Hazan et al.'s FTAL algorithm and the following 
proposition shows that the algorithm (|23[) does in fact have logarithmic regret (we give a proof of 
the proposition, which is somewhat technical, in Appendix [A]) , 

Proposition 3. Let the sequence w(t) be defined by the update (|2"3"j) under Assumption Wl Then 
for any e > and any sequence of samples (x t ,y t ), 



F(«;(t); , yj)) - F( W *; (x t ,y t )) < — log ^— + 1 J + — ||«;* || 2 . 



What remains is to show that a suitable form of stability holds for the algorithm (|23[) that 
we have defined. The additional stability provided by using x t in the update of w(t) appears to 
be important. In the original version 12j of the FTAL algorithm, the predictor w(t) can change 
quite drastically if a sample x t sufficiently different from the past — in the sense that (x t ',x t ) ~ 
for t' < t — is encountered. In the presence of dependence between samples, such large updates 
can be detrimental to performance, since they keep the algorithm from exploiting the mixing of 
the stochastic process. Returning to our argument on stability, we recall the proof of Theorem HJ 
specifically the argument leading to the bound (I19p . We see that the stability bound does not 
require the full power of Assumption O but in fact it is sufficient that 

F(w(t);(x t+T ,y t+T )) - F(w(t + r); {x t+T , y t+T )) < tk(£), 

that is, the differences in loss values are stable. To quantify the stability of the algorithm (123 j) . 
we require two definitions that will be useful here and in our subsequent proofs. Define the outer 
product matrices 

t 

A t := x i x J and A t,* ■= A t + el. (24) 
i=l 

Given a positive definite matrix A, the associated Mahalanobis norm and its dual are defined as 



U! 



\\:= {Aw,w) and 11^11^.-1 := (A 1 w,w) . 



Then the following proposition (whose proof we provide in Appendix [A]) shows that stability holds 
for the linear-prediction algorithm ()23p . 



Proposition 4. Let w(t) be generated according to the update (f23|) and let Assumption [D\ hold. 
Then for any TEN, 

F(w(t); (x t+T ,yt+ T )) ~ F(w(t + r); (x t+T ,y t+T )) 

L 2 ( Tl \ 

< - Ur ht + rf A -^ + 5 £ \\xt +s \\ 2 A - l( + 3 ||x t ||^ J 

v s=l / 
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We use one more observation to derive a generalization bound for the approximate follow-the- 
leader update (j23H . For any loss I satisfying Assumption [Dj standard convex analysis gives that 
\&'(y, a )\ < i so by straightforward algebra (taking a = —Rr and b = Rr), 

A 2L 
2L\a - b\ > -(b - a) 2 , implying A < — . (25) 

Now, using Proposition [J] and the regret bound from Proposition [3j we now give a fast high- 
probability convergence guarantee for online algorithms applied to linear prediction problems, such 
as linear or logistic regression, satisfying Assumption |Pl Specifically, 

Theorem 6. Let w(t) be generated according to the update (|23p with e = 1. Then with probability 
at least 1 — 4(5 log n, for any r G N, 

f(w n ) ~ /(**) < T^(9 + 14r) log (r 2 n + l) + ± \\ w * || 2 + 62L J T l og I 
An ' n An • A m i n (Cov(a;)) o 

8tL 2 ( , r \ 4L 2 . N 

Proof Given the regret bound in Proposition [3l all that remains is to control the stability of the 
algorithm. To that end, note that 

n ~ T 7L 2 t n 7L 2 rd ( r 2 n \ 
y2F(w(t); (x t+T ,y t+T )) - F(w{t + r); (x t+T ,y t+T )) < —— V ||zt||^-i < — r — log hi, 

f^i t=i f ' e V e / 

(26) 

the last inequality following from an application of Hazan et al.'s Lemma 11 12|. Further, using 
Assumption [Dj we know that the Lipschitz constant of F is G < Lr. We mimic the proof of 
Theorem H] for the remainder of the argument. This requires a minor redefinition of our martingale 
sequence, since w(t) depends on xt in the update ([23]) . whereas our previous proofs required w(t) 
to be measurable with respect to J^t—i- As a result, we now define 

X\ := f{w((t - l)r + i)) - f(w*) + F(w*;x tT +i) - F(w((t - l)r + i); x tT+i ), 

and the associated a-fields T\ := Ft T +i = cr(x%, . . . , x tT +i). The sequence X\ — K[Xl \ Fl_i\ defines 
a martingale difference sequence adapted to the filtration F\ , t = 1,2,.... The remainder of the 
proof parallels that of Theorem HJ with the modification that terms involving (r — 1)6? are replaced 
by terms involving tG. Specifically, we use the inequality (|19p . the regret bound from Proposition^ 
and the stability guarantee ([26]) to see 

f(w n )-f(w*)< 



L2 d^ v, (r 2 n \ Ae „ #ll2 32L 2 r 2 r , r 

— (9 + 14r)log + 1 )+— w* l + - log- 

An \ e / n An • A m i n (uov(xj) o 



:!rL/l>r ^logl + 1) +2LRr<t>(r) 



n \ 5 

Noting that Rr < 2L/X by the bound ([25]) completes the proof. □ 
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To simplify the conclusions of Theorem [6l we can ignore constants and the size of the sample 
space X . Doing this, we see that with probability at least 1 — 6, 



/(£„)-/(<)< 0(1)- inf 

reN 



L 2 dr 
An 



log n + 



L T 



An • A r 



t log n 1? 
(Cov(x)) g ~~b~ + T ' 



In particular, we can specialize this result in the face of different mixing assumptions on the process. 
We give the bound only for geometrically mixing processes, that is, when (j)(k) < 4>oex.p(—(frik e ). 
Then we have — as in Corollary [3] — the following: 

Corollary 8. Let w(t) be generated according to the follow-the- approximate leader update (123p and 
assume that the process P is geometrically (^-mixing. Then with probability at least 1 — 5, 



f(Wn) ~ f(w*) < 0(1) 



L 2 d(log n 
4>V 8 An 



,1+71 



+ 



L 2 (logn) 



0i /9 An • A min (Cov(x)) 



log 



log n 



We conclude this section by noting without proof that, since all the results here build on the 
theorems of Section [H it is possible to analogously derive corresponding high-probability conver- 
gence guarantees when the stochastic process P is /3-mixing rather than (^-mixing. In this case, we 
build on Theorem [5] rather than Theorem HI but the techniques are largely identical. 



6 Conclusions 

In this paper, we have shown how to obtain high-probability data-dependent bounds on the gen- 
eralization error, or excess risk, of hypotheses output by online learning algorithms, even when 
samples are dependent. In doing so, we have extended several known results on the generalization 
properties of online algorithms with independent data. By using martingale tools, we have given 
(we hope) direct simple proofs of convergence guarantees for learning algorithms with dependent 
data without requiring the machinery of empirical process theory. In addition, the results in this 
paper may be of independent interest for stochastic optimization, since they show both the ex- 
pected and high-probability convergence of any low-regret stable online algorithm for stochastic 
approximation problems, even with dependent samples. 

We believe there are a few natural open questions this work raises. First, can online algo- 
rithms guarantee good generalization performance when the underlying stochastic process is only 
a-mixing? Our techniques do not seem to extend readily to this more general setting, as it is less 
natural for measuring convergence of conditional distributions, so we suspect that a different or 
more careful approach will be necessary. Our second question regards adaptivity: can an online 
algorithm be more intimately coupled with the data and automatically adapt to the dependence of 
the sequence of statistical samples xi,x%, . . .? This might allow both stronger regret bounds and 
better rates of convergence than we have achieved. 
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A Technical Proofs 

Proof of Proposition [3] We first give an equivalent form of the algorithm (|23p for which it 
is a bit simpler to proof results (though the form is less intuitive). Define the (sub)gradient-like 
vectors g(t) for all t as 

g(t) := VF{w{t); (x t ,y t )) - Xx t xjw(t). (27) 
Then a bit of algebra shows that the algorithm (|23p is equivalent to 



w(t) = argmin <{ ^ (g(i), w) + ^ {A t , e w, to) j . (28) 



We now turn to the proof of the regret bound in the theorem. Our proof is similar to the proofs 
of related results of Nesterov j^j and Xiao (28|. We begin by noting that via Assumption iDl 

n 

Y^F(w(t);(x t ,y t )) - F(w*; (x t ,y t )) 
t=i 

n , n 

< (VF(w(t)\ (x t ,yt)),w(t) -w*)-- - w*) T x t xJ(w(t) - w*) 



2 

t=l t=l 



A x — ^ / i , i . „ \ A 



= ^ ( Vi? ( w (*); ( x * ' y*)) - \x t xjw(t),w(t) - w*J + - X (x t xjw(t),w(t)} - - (xtxjw*,™* 
t=l t=l ' t=l 

n . n . 

= J] <«#),«;(*) - w*) + ± ^ w ^)) 2 ~ \ (Anw*,w*) . (29) 
t=i t=i 

Define the proximal function ipt(w) = ^ (A tte w,w) and let z(t) = ^* =1 <?(i)- Then we can bound 
the regret (f2Uj) by taking a supremum and introducing the conjugate to ip, defined by ipn( z ) = 
sup to6 yy{ (z, w) — ip n (w)}. In particular, we see that for any e > 



Y F{w{t); (x t , y t )) - F(w* ; (x t , y t )) 



t=i 

n 



a r 



A , . - Ae „ 2 ] , Ae 2 



w) - o - — H| 2 ^ + — ||w 



2 

A ^ , ,.^o . / \ \ Ae 



2 



= Y (9(t), w(t)) + -Y fa, ™(t)f + <(-*(»)) + y IKII2 ■ (30) 

t=l i=l 

The function ^* has (1/ AV-Lipschitz continuous gradient with respect to the Mahalanobis norm 
induced by A n ^ (e.g. [15|,[23j]), and further it is known that Viftn( z ) = argmin TOe yy{(— z, w) +ip n (w)} 
so that V^*(— z(n — 1)) = to(n) by definition of the update ([23]) . Thus we see 

<(-*(n)) < <(-*(n - 1)) + <V<(-s(n - l)),*(n - 1) - *(n)> + -L ||z(n) - z(n - l)^ 

= r n (-z(n - 1)) - (Mn), <?(«)) + — ||0(n)||*_i 

2A /in - £ 

A 1 2 

= - (2(71 - l),io(n)) - - (A n ^w{n),w{n)) - (w(n),g(n)} + — ||ff(n)|| A -i . 
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since w(n) minimizes (z(n — l),w) +ip n (w). Plugging the last inequality into the bound ([30]) yields 

n 

J2F(w(t);(x t , yt )) - F(w*;(x t ,y t )) 



t=i 



n A A 

< ^ (9(t),w(t)} + -^] (x t ,w(t)} 2 - (z(n - l),u>(n)) - - {A n ^w{n),w{n)) - (w(n),g(n)) 



t=i 



t=i 



Ae 

Y 



+ -\\™ II2 + 2aII^)IU-i 



n— 1 ^ n— 1 ^ 



t=i 



Ae 
Y 



+ -\\™ h + 2x\\9{n)\\ A -i 



n— 1 , n— 1 . 1 

< 53 (g(t),w(t)) + - £ <* t , u;(t)) 2 + <_ x (-z(n " 1)) + y IKII2 + ^ l^fe* 
since = ^4 n _i + x n x^. Repeating the argument inductively down from n — 1, we find 

n , n 



£ (y t , a*)) - F(w*; (y t ,x t )) < ±. \\g 



, Ae II *l|2 

i + tt If ll 2 



(31) 



i=l " i=l 

The bound (|3ip nearly completes the proof of the theorem, but we must control the gradient 
norm ||<?(t)]|4-i terms. To that end, let at = i'iyt, (xt,w(t))) € M and note that 

llffwll^-i = ( a t x t ~ Xx t x t w(t)),a t x t - Xx t x t w(t)j < (L + XRr) ||xt|| A -i 
since by Assumption |Pl \at\ < L. Now we apply a result of Hazan et al. [12 . Lemma 11], giving 

E IbWIlA", 1 — + Ai?r) 2 dlog \LIL + lj . 

Using that A < 2L/(Rr), we combine this with the bound (|31|) to get the result of the theorem. □ 

Proof of Proposition [4] We begin by noting that any g £ dF(w(t); (xt+ T ,yt+ T )) can be written 
as axt+ T for some a € [— L, L\. Thus, using the first-order convexity inequality, we see there is such 
an a for which 

F(w(t); x t+T ) - F(w(t + r); x t+T ) < a (x t+T ,w(t) - w(t + r)) . 
Now we apply Holder's inequality and Lemma El which together yield 
{x t+T ,w(t) - w(t + t)) 



< \\ x t+r\ 



\w(t)- W (t + T)\\ At+Ti 



T-l 



3L ^ — -\ .. || || || "2L ^ — \ || || || || 

<Th %t a- 1 \\ x t+s I4- 1 +tL ^r 1 \\ x t+s \\ A -i 

A Z — ' t + T,e ^t + s.e A Z — ' t+T,t "-t + s 

s=0 s=l 



< 



3L 
2A 



■T-l 



S=0 



En- i ' — \ n 1 . 2 r 1 2 

% L-i +T\\x t+ T L4-1 + V 7 J X *+s L4-1 + r Ft+r I4-1 
^t + s,e t + T,e A * ' t + s,e t + r,e 
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where we have used the fact that (a 2 + b 2 ) /2 > ab for any o,i»Gl. A re-organization of terms and 
using the fact that |a| < L completes the proof. □ 



Lemma 9. Let w(t) be generated according to the update (|23j) . Then for any r G N, 

3L 2L s—^ 

\\w(t) - w(t + T)\\ At+re <~Y1 W Xt + s h^ 3>e + ~Y1 W^+'Wa; 



-i 

S=0 " S=l 



Proof Recall the definition (|24p of the outer product matrices A t and the construction (|27p of 
the subgradient vectors g(t) from the proof of Proposition [3l With the definition z(t) = J2i=i 9^S)i 
also as in Proposition [3j it the update (f23|) is equivalent to 

w(t) = argmin { (z(t - l),w) + ~ {At, e w, w) \ . (32) 

Now, let us understand the stability of the solutions to the above updates. Fixing r £ N, the first 
order conditions for the optimality of w(t + 1) in the update (f32j) for w(t) and w(t + r) imply 

+ r - 1) + AA i+T>e i(;(t + t),w- w(t + r)) > and <z(t - 1) + XA t>e w(t), w' - w(t)) > 0, 

for all w, w' G W. Taking w = w(t) and «/ = w(t + r), then adding the two inequalities, we see 

(z(t + r - 1) - z(t - 1) + Avl i+rie u>(i + r) - A^u^t), - u>(t + t)) > 0. (33) 



The remainder of the proof consists of manipulating the inequality (|33p to achieve the desired 
result. To begin, we rearrange Eq. ([33]) to state 



{z(t + t - I) - z(t - l),w(t) - w(t + r)) 

> A {A t+T Jw(t) - w(t + t)), w(t) - w(t + r)) + A ((A e - A t+T , £ )w(t), - w(t + r)) 



= A ||w(t) - w(t + r)\\ 2 At+T c + A ((A t , e - ^, e )w(t),w(t) - w(* + r)) . 
Using Holder's inequality applied to the dual norms ||-||^ and we see that 

MHt)-w(t + T)f A 



< \\z(t + T- 1) - Z (t ~ l)|| A -l \\w(t) - W(t + T, nA 



+ A ||(A t+r , e - A t e )w(t)\\ A -i \\w(t) - w(t + t)\\ a 
and dividing by A — w(t + r)\\ gives 

Kt)-w(t + r)|| A < \ \\z(t + r-l)-z(t-l)\\ A +\\(A t+Tj£ -A t>e )w(t)\\ A -x . (34) 
Now we note the fact that A t + T ^ e — A^ e = Y^=i x t+s%J+ s , so 

r r 

II (A t +r,e - A ttt )w(t)\\ A -i < max | (x t+s , w(t)) | V ||x t+s |L-i < #r V ||x t+s |L-i . 

s=l s=l 
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In addition, we have z(t + r — 1) — z(t — 1) = X^I=o 9& + s )' an( ^ as m ^ ne proof °f Proposition El 

r-l r-1 

||z(t + r- 1) -z(t- 1)11^-1 < (L + Ai?r) V||x t+s || A -i < 3L V||x t+s || A -i , 

i + r,e ' ■* £ + T,e ■ " t + T,e 

where for the last inequality we used the bound ()25|) . which implies i?r < 2£. Thus the inequal- 
ity (J3ID yields 

3L 2.L x r ^ 

|Ki) - w(t + r)\\ At+Tf <~Y1 W x t+sh-i Te + X E H^+'IL^. ■ 

S = 3=1 

Noting that ^4t+i, e ^ A t ^ completes the proof. □ 

References 

[1] K. Azuma. Weighted sums of certain dependent random variables. Tohoku Mathematical 
Journal, 68:357-367, 1967. 

[2] P. Bartlett, O. Bousquet, and S. Mendelson. Local rademacher complexities. Annals of Statis- 
tics, 33(4): 1497-1537, 2005. 

[3] A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for 
convex optimization. Operations Research Letters, 31:167-175, 2003. 

[4] P. Billingsley. Probability and Measure. Wiley, Second edition, 1986. 

[5] O. Bousquet and A. Elisseeff. Stability and generalization. Journal of Machine Learning 
Research, 2:499-526, 2002. 

[6] R. C. Bradley. Basic properties of strong mixing conditions, a survey and some open questions. 
Probability Surveys, 2:107-144, 2005. 

[7] N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning 
algorithms. IEEE Transactions on Information Theory, 50:2050-2057, 2004. 

[8] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University 
Press, 2006. 

[9] J. Duchi, S. Shalev-Shwartz, Y. Singer, and A. Tewari. Composite objective mirror descent. 
In The 23rd Annual Conference on Computational Learning Theory, 2010. 

[10] J. C. Duchi, A. Agarwal, M. Johansson, and M. I. Jordan. Ergodic subgradient descent. URL 
|http : //arxiv . org/abs/1 105 . 4681] 2011. 

[11] D. A. Freedman. On tail probabilities for martingales. The Annals of Probability, 3(1):100-118, 
Feb. 1975. 

[12] E. Hazan, A. Agarwal, and S. Kale. Logarithmic regret algorithms for online convex optimiza- 
tion. Machine Learning, 69, 2007. 



25 



[13] M. Herbster and M. Warmuth. Tracking the best expert. Machine Learning, 32(2): 151-178, 
1998. 

[14] M. Herbster and M. Warmuth. Tracking the best linear predictor. Journal of Machine Learning 
Research, 1:281-309, 2001. 

[15] J. Hiriart-Urruty and C. Lemarechal. Convex Analysis and Minimization Algorithms I & II. 
Springer, 1996. 

[16] S. F. Jarner and G. O. Roberts. Polynomial convergence rates of markov chains. The Annals 
of Applied Probability, 12(l):pp. 224-247, 2002. 

[17] S. M. Kakade and A. Tewari. On the generalization ability of online strongly convex program- 
ming algorithms. In Advances in Neural Information Processing Systems 21, 2009. 

[18] H. J. Kushner and G. Yin. Stochastic Approximation and Recursive Algorithms and Applica- 
tions. Springer, Second edition, 2003. 

[19] R. Meir. Nonparametric time series prediction through adaptive model selection. Machine 
Learning, 39:5-34, 2000. 

[20] S. Meyn and R. L. Tweedie. Markov Chains and Stochastic Stability. Cambridge University 
Press, Second edition, 2009. 

[21] D. Modha and E. Masry. Minimum complexity regression estimation with weakly dependent 
observations. IEEE Transactions on Information Theory, 42(6):2133-2145, 1996. 

[22] M. Mohri and A. Rostamizadeh. Stability bounds for stationary ^-mixing and /^-mixing pro- 
cesses. Journal of Machine Learning Research, 11:789-814, 2010. 

[23] Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical Program- 
ming A, 120(l):261-283, 2009. 

[24] S. Shalev-Shwartz and Y. Singer. Logarithmic regret algorithms for strongly convex repeated 
games. Technical Report 42, The Hebrew University, 2007. 

[25] S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: primal estimated sub-gradient 
solver for SVM. Mathematical Programming Series B, page To appear, 2011. 

[26] I. Steinwart and A. Christmann. Fast learning from non-i.i.d. observations. In Advances in 
Neural Information Processing Systems 22, pages 1768-1776, 2009. 

[27] G. Wei and M. A. Tanner. A Monte Carlo implementation of the EM algorithm and the 
poor man's data augmentation algorithms. Journal of the American Statistical Association, 
85(411):699-704, 1990. 

[28] L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. 
Journal of Machine Learning Research, 11:2543-2596, 2010. 

[29] B. Yu. Rates of convergence for empirical processes of stationary mixing sequences. Annals of 
Probability, 22(1):94-116, 1994. 

[30] B. Zou, L. Li, and Z. Xu. The generalization performance of ERM algorithm with strongly 
mixing observations. Machine Learning, pages 275-295, 2009. 



26 



